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Abstract 

Compressed Sensing decoding algorithms can efficiently recover an TV dimensional real- 
valued vector x to within a factor of its best fc-term approximation by taking m = 2k log N/k 
measurements y — <&x. If the sparsity or approximate sparsity level of x were known, then 
this theoretical guarantee would imply quality assurance of the resulting compressed sensing 
estimate. However, because the underlying sparsity of the signal x is unknown, the quality 
of a compressed sensing estimate x using m measurements is not assured. Nevertheless, we 
demonstrate that sharp bounds on the error — x\\ t N can be achieved with almost no ef- 
fort. More precisely, we assume that a maximum number of measurements m is prc-imposed; 
we reserve 41ogp of the original m measurements and compute a sequence of possible esti- 
mates (xj) P . =1 to x from the m — 41ogp remaining measurements; the errors ||a; — Xj\\{N for 
j = l,...,p can then be bounded with high probability. As a consequence, numerical upper 
and lower bounds on the error between x and the best fc-term approximation to x can be 
estimated for p values of k with almost no cost. Our observation has applications outside of 
compressed sensing as well. 
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1 Introduction 

Compressed Sensing (CS) is a fast developing area in applied mathematics, motivated by the 
reality that most data we store and transmit contains far less information than its dimension 
suggests. For example, a one-dimensional slice through the pixels in a typical grayscale image 
will contain segments of smoothly varying intensity, with sharp changes between adjacent pixels 
appearing only at edges in the image. Often this sparsity in information translates into a sparse 
(or approximately sparse) representation of the data with respect to some standard basis; for 
the image example, the basis would be a wavelet of curvelet basis. For such N dimensional 
data vectors that are well approximated by a /c-sparse vector (or a vector that contains at most 
k << N nonzero entries), it is common practice to temporarily store the entire vector, possibly 
with the intent to go back and replace this vector with a smaller dimensional vector encoding 
the location and magnitude of its k significant coefficients. In compressed sensing, one instead 
collects fewer fixed linear measurements of the data to start with, sufficient in number to recover 
the location and numerical value of the k nonzero coordinates at a later time. Finding "good" 
linear measurements, as well as fast, accurate, and simple algorithms for recovering the original 
data from these measurements, are the twofold goals of compressed sensing research today. 

Review of basic CS setup. The data of interest is taken to be a real-valued vector x € R N 
that is unknown, but from which we are allowed up to m < N linear measurements, in the 
form of inner products of x with m vectors Vj £ of our choosing. Letting $ denote the 
m x N matrix whose jth. row is the vector Vj, this is equivalent to saying that we have the 



freedom to choose and store an m x N matrix <£, along with the m-dimensional measurement 
vector y = <J?x. Of course, since $ maps vectors in M> N to vectors in a smaller dimensional space 
IR m , the matrix is not invertible, and we thus have no hope of being able to reconstruct an 
arbitrary N dimensional vector x from such measurements. 

However, if the otherwise unknown vector x is specified to be fc-sparse, and k is fairly small 
compared with N, then there do exist matrices <!> for which y = &x uniquely determines x, 
and allows recovery of x using fast and simple algorithms. It was the interpretation of this 
phenomenon given by Candes and Tao [I], [2], and Donoho [3], that gave rise to compressed 
sensing. In particular, these authors define classes of matrices that possess this property. One 
particularly elegant characterization of this class is via the Restricted Isometry Property (RIP) 
[2|. A matrix with unit normed columns is said to be A;- RIP if all singular values of any k 
column submatrix of $ lie in the interval [1 — 6k, 1 + 5k] for a given constant 5k < 1. With high 
probability, 2/c-RIP is obtained, with 

1 

k = K(m,N) := 2m/ log(N/m)), where m < — m, (1) 

for mx N matrices $ whose entries <&jj are independent realizations of a Gaussian or Bernoulli 
random variable [4 J . Also with high probability, an mx N matrix obtained by selecting m rows 
at random from the N x N discrete Fourier matrix satisfies 2/c-RIP of the same order as Q 
up to an additional log 3 iV factor [28]. In fact, the order of k given by ([TJ is optimal given m 
and N, as shown in [5] using classical results on Gelfand widths of unit balls in 1% . To date, 
there exist no deterministic constructions of RIP matrices of this order. 



Recovering or approximating x. As shown in |21) . the following approximation results 
hold for matrices $ that satisfy 2/c-RIP with constant 62k < V% — 1 ; 

1. If x e R N is k -sparse, then x can be reconstructed from <I> and the measurement vector 
y = <J>x as the solution to the following l\ minimization problem: 

x = £,x(<&,y) := arg min ||.z||i. (2) 

<S>z=y 



2. If x is not fc-sparse, the error between x and the approximation x = £i(&,y) is still 
bounded by 

\\x - x\\ 2 < —^ a k(x)iN, (3) 

where c = 2(1 + <52fc)/(l — 52k), and ak(x) t N := mf| z |<£ | \x — z\\in denotes the best possible 
approximation error in the metric of lp between x and the set of A:-sparse signals in 
~R N . The approximation error o~k(x)iN is realized by the /c-sparse vector Xk £ ^ N that 
corresponds to vector x with all but the k largest entries set to zero, independent of the 
lp norm in the approximation <Jk(x)iN. 

This immediately suggests to use the ^i-minimizer C\ as a means to recover or approximate an 
unknown x with sparsity constraint. Several other decoding algorithms are used as alternatives 
to l\ minimization for recovering a sparse vector x from its image y = 3>x, not because they 
offer better accuracy ( l\ minimization gives optimal approximation bounds when $ satisfies 
RIP), but because they can be faster and easier to implement. For a comprehensive survey on 
compressed sensing decoding algorithms, we refer the reader to |30j . 

Estimating the accuracy of CS estimates. According to the bound O, the quality of 
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a compressed sensing estimate x = A($,y) depends on how well x can be approximated by 
a /c-sparse vector, where the value of k is determined by the number of rows m composing <£. 
While k is assumed to be known and fixed in the compressed sensing literature, no such bound 
is guaranteed for real-world signal models such as vectors x 6 ~R N corresponding to wavelet co- 
efficient sequences of discrete photograph-like images. Thus, the quality of a compressed sensing 
estimate x in general is not guaranteed. 

If the error ||x — x||2/||x||2 incurred by a particular approximation x were observed to be 
large, then decoding could be repeated using a larger number of measurements, perhaps at 
increasing measurement levels {mi,m2,...,m p }, until the error ||x — Xj\ I2/I \x\ (2 corresponding 
to rrij measurements were observed to be sufficiently small. Of course, the errors ||x — ij\\2 and 
\\x — Xj\\2/\\x\\2 are typically not known, as x is unknown. Our main observation is that one 
can apply the Johnson-Lindenstrauss lemma [13] to the set of p points, 

{(x - xi), (x - x 2 ), (x - x p )}. (4) 

In particular, r = O(logp) measurements of x, provided by yq, = ^fx, when ^ is, e.g. a Gaussian 
or Bernoulli random matrix, are sufficient to guarantee that with high probability, 

4/5||y*-*%|| 2 < \\x-xj\\ 2 < 4/3||y*-Wx i || 2 (5) 

and 

1//3 \\y^-^j\\2 < ||a-aj|| 2 < 2 \\m- frsjlh ^ 
I|y*ll2 ~~ 1 1 1 1 2 ~~ lly^lb 

for any p compressed sensing estimates. The equivalences ^ and ^ allow the measurable 
quantities 1 1 y$ — \Mj 1 1 2 and 1 1 y-% — ^Xj \ \ 2 / \ \ \ \ 2 to function as proxies for the unknown quantities 
||x — X7II2 and ||x — XjHa/IMb; these proxies can be used to 

(a) provide tight numerical upper and lower bounds on the errors \\x— %|| 2 and £j 1 1 2 / 1 1 ^ 1 1 2 
at up to p compressed sensing estimates Xj, 

(b) provide estimates of the underlying /c-term approximations ||x — Xfc||2 of x for up to p 
different values of k, and 

(c) return from among a sequence of estimates (xi, x p ) with different initialization parame- 
ters, an estimate x cv = argmin^ | \y^, — ^Xj\ \ 2 having error | |x — x cv \\ 2 that does not exceed 
a small multiplicative factor of the best possible error in the metric of between x and 
an element from the sequence at hand. 

More precisely, all CS decoding algorithms require as input a parameter m corresponding to the 
number of rows in $; some compressed sensing decoding algorithms (such as greedy algorithms) 
require also a parameter k indicating the sparsity level of x, and other algorithms require as 
input a bound 7 on the expected amount of energy in x outside of its significant coefficients. All 
CS decoding algorithms can be symbolically represented by functions of the form A(<£, y, k, 7), 
and we will give examples where each of the parameters m, k, and 7 can be optimized over 
a sequence of estimates (x\, x 2 , x p ) parametrized by increasing hypotheses on each of the 
variables m, k, and 7. 



The estimation procedure described above, although novel in its proposed application, is by 
no means new. Cross validation is a technique used in statistics and learning theory whereby 
a data set is separated into a training/estimation set and a test/cross validation set, and the 
test set is used to prevent overfitting on the training set by estimating underlying noise pa- 
rameters. We will take a set of m measurements of x, and use m — r of these measurements, 
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in a compressed sensing decoding algorithm to return a sequence (xi,x%, ...) of candidate 
approximations to x. The remaining r measurements, ^fx, are then used to identify from among 
this sequence a single approximation x = Xj, along with an estimate of the sparsity level of 
x. The application of cross validation to compressed sensing has been studied by Boufounos, 
Duarte, and Baraniuk in |7 , but in a different context from the present paper, and without the 
mathematical justification of the Johnson Lindenstrauss lemma that we present below. 

2 Preliminary Notation 

Throughout the paper, we will be dealing with large dimensional vectors that have few nonzero 
coefficients. We use the notation 



to indicate that a vector x G R has exactly n nonzero coefficients. 



We will sometimes use the notation a ~ e b as shorthand for the multiplicative relation 



that can be worded as "the quantity a approximates the quantity b to within a multiplicative 
factor of (1 ± e)". Note that the relation ~ e is not symmetric. Properties of the relation a ~ e b 
are listed below; we leave the proofs (which amount to a string of simple inequalities) as an 
exercise for the reader. 

Lemma 2.1. Fix e G (0, 1). 

1. If a,b G satisfy a ~ e b, then 6/[(l + e)(l - e)] ~ e a. 

2. If a,b,c,d G M + satisfy a ~ e b and c ~ e d, then a/c ~5 b/d for parameter 5 = 2e/l — e. 

3. If (ai, a%, a p ) and (&i, 62, b p ) are sequences in M + , and a,j ~ e bj for each 1 < j < p, 
then mim,- aj ~ e minj bj. 

3 Mathematical Foundations 

The Johnson Lindenstrauss (JL) lemma, in its original form, states that any set of p points in high 
dimensional Euclidean space can be embedded into e~ 2 log(p) dimensions, without distorting the 
distance between any two points by more than a factor of (1 ± e) |13| . In the same paper, it 
was shown that a random orthogonal projection would provide such an embedding with positive 
probability. Following several simplifications to the original proof |15j . |12j . |14j . it is now 
understood that Gaussian random matrices, among other purely random matrix constructions, 
can substitute for the random projection in the original proof of Johnson and Lindenstrauss. Of 
the several versions of the lemma now appearing in the literature, the following variant presented 
in Matousek [16] is most applicable to the current presentation. 

Lemma 3.1 (Johnson-Lindenstrauss Lemma). Fix an accuracy parameter e G (0, 1/2], a confi- 
dence parameter 5 G (0, 1), and an integer r > ro = Ce~ 2 log ^. 

Let Ai be a random r x N matrix whose entries M-ij are independent realizations of a ran- 
dom variable R that satisfies: 



x = n 



(7) 




(8) 
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1. Var(R) = 1/r (so that the columns of M. have expected £2 norm 1) 



2. E(R) = 0, 

3. For some fixed a > and for all A, 

Prob[\R\ > A] < 2e- aX2 (9) 

Then for a predetermined x E M^, 

(l-e)||z||^<||A4x||, 5 < (l + e)||a% (10) 
is satisfied with probability exceeding 1 — 5. 



The constant C bounding ro in Lemma (3.1) grows with the parameter a specific to the con- 
struction of Ai Q. Gaussian and Bernoulli random variables R will satisfy the concentration 
inequality (|9| for a relatively small parameter a (as can be verified directly), and for these ma- 



trices one can take C = 8 in Lemma (3.1 ). 



The Johnson Lindenstrauss lemma can be made intuitive with a few observations. Since 
E[i2] = and Var[i?] = -, the random variable 1 1 TWa; 1 1 2 equals \ \x\\\ in expected value; that is, 

E[ 11*1*111] =IW||. (11) 
Additionally, H-MxH 2 . inherits from the random variable R a nice concentration inequality: 

Prob[||.Mx|||- > e||x||l] < e~ a{2e ^ )2 < 5/2. (12) 

The first inequality above is at the heart of the JL lemma; its proof can be found in |16j . The 
second inequality follows using that r > (2ae 2 ) -1 log(|) and e < 1/2 by construction. A bound 



k 2 ' 

similar to (12) holds for Probl H-MxH^ — H^Hl < — e ll x ll2] as wei b an d combining these two 



bounds gives desired result (10) 



For fixed x € 1R , a random matrix J\A constructed according to Lemma (|3.1l) fails to sat- 



isfy the concentration bound (10) with probability at most 5. Applying Boole's inequality, A4 
then fails to satisfy the stated concentration on any of p predetermined points {xj}^ =1 , Xj G M. N , 
with probability at most £ = p5. In fact, a specific value of £ E (0, 1) may be imposed for fixed 
p by setting 5 = £/p. These observations are summarized in the following corollary to Lemma 



Corollary 3.2. Fix an accuracy parameter e E (0, 1/2], a confidence parameter £ E (0, 1), and 
i: - " ■'' " "joints {xj\ V j=\ C Mr. Set 5 = £/p, and fix an integer r > ro = Ce~ 2 log^j = 



Ce 2 log J|. If M is a r x N matrix constructed according to Lemma (3.1 ), then with probability 
> 1 — £, the bound 

(1- e)||^||^ < llMs^r < (1 + 6)11^-11^ (13) 
obtains for each j = 1, 2, ...,p. 



4 Cross Validation in Compressed Sensing 

We return to the situation where we would like to approximate a vector x E M. N with an assumed 
sparsity constraint using m < N linear measurements y = Ax where A is an m x N matrix of 
our choosing. Continuing the discussion in Section 1, we will not reconstruct x in the standard 
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way by x = A(A, y, k, 7) for fixed values of the input parameters, but instead separate the mxN 
matrix A into an n x N implementation matrix and anrxJV cross validation matrix and 
separate the measurements y accordingly into y$ and y^. We use the implementation matrix 
and corresponding measurements y$ as input into the decoding algorithm to obtain a sequence 
of possible estimates (x\, x p ) corresponding to increasing one of the input parameters m, k, or 
7. We reserve the cross validation matrix and measurements y^ to estimate each of the error 
terms \ \x — Xj\\2 in terms of the computable \\y^ — ^fXj\\2- Our main result, which follows from 
Corollary (3.2), details how the number of cross validation measurements r should be chosen in 
terms of the desired accuracy e of estimation, confidence level £ in the prediction, and number 
p of estimates xj to be measured: 

Theorem 4.1. For a given accuracy e € (0,1/2], confidence £ S (0,1), and number p of es- 
timates Xj G Mr, it suffices to allocate r = [~Ce -2 log ^|] rows to a cross validation matrix ^ 
of Gaussian or Bernoulli type, normalized according to Lemma (3.2) and independent of the 
estimates Xj, to obtain with probability greater than or equal to 1 — and for each j = 1, 2, ...,p, 
the bounds 



\x — x 



1 + e 



< 



< 



(14) 



and 



1 - 3e 



\x — x 



(l + 6)(l- e )2 



< 



\ x \\2 



\^(x 



Wx\ 



< 



(1 



and also 



where rj 



1 

1 + e 



< ^ < 



1 

1 - e 



(15) 



(16) 



mm i<j<p If 



n is the unknown oracle error corresponding to the best possible 



approximation to x in the metric ofl^ from the sequence (x\, x p ), and ff^ = minK^xp 1 1 \E' (a; - 



x j)\\q 



Proof. 



is the observable cross validation error. 



• The bounds in ( 14 ) are obtained by application of Lemma (3.2 ) to the p points u 



and rearranging the resulting bounds according to Lemma (2.1 ) part (1). The bound (16) 



follows from the bounds (14) and part (3) of Lemma (2.1) 



x — x 



.1 ' 



• The bounds in (15) are obtained by application of Lemma (3.2) to the p + 1 points no = 
and regrouping the resulting bounds according to part (2) of Lemma (|2.1[) 



□ 

Remark 4.2. The measurements making up the cross validation matrix ^ must be independent 
of the measurements comprising the rows of the implementation matrix <£. This comes from the 
requirement in Lemma (3.1 ) that the matrix ^ be independent of the points Uj = x — Xj. This 



requirement is crucial, as observed when x solves the l\ minimization problem 

x = arg min ||z||i subject to &z = $x, 



(17) 



in which case the constraint <&(x — x) = clearly precludes the rows of $ from giving any 
information about the error II 2; — xlk- 
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Remark 4.3. Theorem (4.1) should be applied with a different level of care depending on 
what information about the sequence (x — x\,x — X2,---,x — x p ) is sought. If the minimizer 
x = argmini<j< p | \ty(x — %)||zj is sufficient for one's purposes, then the precise normalization 
of ^ in Theorem (4.1) is not important. The normalization doesn't matter either for estimating 

1 2- On the other hand, if one is using cross validation 
xA\2, then normalization is absolutely crucial, and 



x 



mi 



■>"■ 



the normalized quantities \\x 
to obtain estimates for the quantities 
one must observe the normalization factor given by Lemma (3.2) that depends on the number 
of rows r allocated to the cross validation matrix \E'. 



5 Applications of cross validation to compressed sensing 
5.1 Estimation of the best &;-term approximation error 

We have already seen that if the m x N matrix satisfies 2/c-RIP with parameter 5 < \[2 — 1, 
and x = &x) is returned as the solution to the t\ minimization problem ([2]), then the error 

between x and the approximation x is bounded by 

c 

\\x — x\\2 < — ^<7fc(a;)jiv. (18) 
yk 1 

Several other decoding algorithms in addition to i\ minimization enjoy the reconstruction guar- 



antee ( 18 ) under similar bounds on <£, such as the Iteratively Reweighted Least Squares algorithm 



(IRLS) [29] , and the greedy algorithms CoSAMP |30j and Subspace Pursuit [31] . It has recently 



been shown |T8J [20] that if the bound (18) is obtained, and if x — x lies in the null space of 
<I> (as is the case for the decoding algorithms just mentioned), then if is a Gaussian or a 
Bernoulli random matrix, the error ||x — x\\2 also satisfies a bound, with high probability on <!>, 
with respect to the residual, namely 

\\x - x\\ 2 < cakix)^, (19) 



for a reasonable constant c depending on the RIP constant 82k of <!>. In the event that (19) 
is obtained, a cross validation estimate \\^{x — x)||r can be used to lower bound the residual 
ak(x)iN, with high probability, according to 



|*(x - x)||^ < ||z - xW^ < ca fc (x)^, (20) 



with 0{\) rows reserved for the matrix ^ (4.1 ). At this point, we will use Corollary 3.2 of [8] 



where it is proved that if the bound (18) holds for x with constant c, then the same bound will 
hold for 

Xk = arg min \\x — z\\in, (21) 

z:\z\<k 2 

the best /c-sparse approximation to x, with constant c = 3c. Thus, we may assume without loss 
of generality that x, is fc-sparse, in which case | \^{x — £)||n; also provides an upper bound on the 
residual (Jfc(x) z jv by 

(l + e)||*0z-x)||jr > ||x-x||^ >a k {x\N. (22) 

With almost no effort then, cross validation can be incorporated into many decoding algorithms 
to obtain tight upper and lower bounds on the unknown ^-sparse approximation error Ok{x\N 
of x. More generally, the allocation of 10 log p measurements to the cross validation matrix 
is sufficient to estimate the errors (||x — a^fc^ 1 1 2)^=1 or the normalized approximation errors 
(||x — x/ Cj ||2/||x||2)j =1 at p sparsity levels kj by decoding p times, adding rrij measurements to the 
implementation matrix $j at each repetition. Recall that the quantities kj and rrij are related 
by kj = 2m,j / \og{N /rrij)) according to ([T|. 
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5.2 Choice of the number of measurements m 



Photograph-like images have wavelet or curvelet coefficient sequences x 6 M. N that are com- 
pressible [23j [32 j , having entries that obey a power law decay 

M(fc) < c s k~ s , (23) 

where denotes the kth largest coefficient of x in absolute value, the parameter s > 1 indicates 
the level of compressibility of the underlying image, and c s is a constant that depends only on s 



and the normalization of x. From the definition (23), compressible signals are immediately seen 
to satisfy 

\\x-x k \\ 1 /Vk<c' s k- s+1 / 2 , (24) 

so that the solution x m = to the l\ minimization problem ^ using an m x N matrix 

of optimal RIP order k = 2m/ log (iV/m)) satisfies 

\\x-x m \\ 2 < c S)5 k- s+1 ' 2 . (25) 

The number of measurements m needed to obtain an estimate x m satisfying \ \x — x m \\2 < r for 
a predetermined threshold r will vary according to the compressibility of the image at hand. 
Armed with a total of m measurements, the following decoding method that adaptively chooses 
the number of measurements for a given signal x presents a more democratic alternative to 
standard compressed sensing decoding structure: 

Table 1: CS decoding structure with adaptive number of measurements 



1. Input: The m-dimensional vector y = <I>x, the m x N matrix <&, (in some algorithms) the sparsity 
level k, and (again, in some algorithms) a bound 7 on the noise level of x, the number p of of row 
subsets of <!>, (4>i, $2, & p ), corresponding to increasing number of rows mi < m 2 < ... < m p < m, 
and threshold r > 0. 

2. Initialize the decoding algorithm at j = 1. 

3. Estimate Xj — A($ TOj , j/ m ^ , fc, 7) with the decoder A at hand, using only the first rrij measurement 
rows of The previous estimate Xj—i can be used for "warm initialization" of the algorithm, 
if applicable. The remaining rj — m — mj measurement rows are allocated to a cross validation 
matrix ^ that is used to estimate the resulting error ||x — & j ] 1 2 / ] 1 3^ | J 2 - 

4. Increment j by 1, and iterate from step 3 if stopping rule is not satisfied. 

5. Stop: at index j = j* < p if ||x— x mj ||2/||x||2 < t holds with near certainty, as indicated by 

|*(x-x ro J|| 2 /||$x|| 2 



y/r] - 3 log p 



< t (26) 



according to Theorem (4.1). If the maximal number of decoding measurements m p < m have been 



used at iteration p, and ( 26 1 indicates that ||x — implh/IMh > t still, return x m = A(<6, y, k, 7) 
using all m measurements, but with a warning that the underlying image x is probably too dense, 
and its reconstruction is not trustworthy. 



5.3 Choice of regularization parameter in homotopy-type algorithms 

Certain compressed sensing decoding algorithms iterate through a sequence of intermediate esti- 
mates Xj that could be potential optimal solutions to x under certain reconstruction parameter 
choices. This is the case for greedy and homotopy-continuation based algorithms. In this section, 
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we study the application of cross validation to the intermediate estimates of decoding algorithms 
of homotopy-continuation type. 



LASSO is the name coined in (??) for the problem of minimizing of the following convex pro- 
gram: 

jgM = arg min \\$>x — ^z\\i 2 + t||z||i (27) 



The two terms in the LASSO optimization problem (27) enforce data fidelity and sparsity, re- 



spectively, as balanced by the regularization parameter r. In general, choosing an appropriate 



value for r in (27) is a hard problem; when $ is an underdetermined matrix, as is the case in 
compressed sensing, the function /(r) = ||x — z^||2 is unknown to the user but is seen empiri- 
cally to have a minimum at a value of r in the interval [0, | \$>x\ |oo] that depends on the unknown 
noise level and/or and compressibility level of x. 

The homotopy continuation algorithm [26], which can be viewed as the appropriate variant 



of LARS [26 , is one of many algorithms for solving the LASSO problem (27) at a predeter- 



mined value of r; it proceeds by first initializing r' to a value sufficiently large to ensure that the 



l\ penalization term in (27) completely dominates the minimization problem and x^ T > = triv- 
ially. The homotopy continuation algorithm goes on to generate x' r 1 =0 for decreasing t' until 
the desired level for r is reached. If r = 0, then the homotopy method traces through the entire 
solution path x^ 1 G M. N for r' > before reaching the final algorithm output x^ = £\(&,y) 
corresponding to the l\ minimizer ([2]). 



From the non-smooth optimality conditions for the convex functional (27), it can be shown 



that the solution path x<- r ' G M is a piecewise-affine function of r [26], with "kinks" possible 



only at a finite number of points r G {t\,T2, ...}. Theorem (4.1) suggests a method whereby 
an appropriate value of r* can be chosen from among a subsequence of the kinks (tx,T2, ■■■,t p ) 
by solving the minimization problem x' r *' = argminj< p ||\I/(x — x^')!^ for appropriate cross 
validation matrix ty. Moreover, since the solution x — zM for tj < r < Tj+i is restricted to 
lie in the two-dimensional subspace spanned by x — x^ and x — x^ +1 \ one can combine the 
Johnson Lindenstrauss Lemma with a covering argument analogous to that used to derive the 
RIP property for Gaussian and Bernoulli random matrices in [3], to cross validate the entire 
continuum of solutions between t\ < r < r p . More precisely, the following bound holds 



under the conditions outlined in Theorem (4.1), with the exception that 2r (as opposed to r) 
measurements are reserved to VP: 

1 < min T1 < r < rp ||x-xt r ]|| 2 < 
1 + e - min Tl < T < T J|^(x-xM)|| 2 " 1-e 



Unfortunately, the bound (28) is not strong enough to provably evaluate the entire solution 
path xM for r > 0, because the best upper bound on the number of kinks on a generic LASSO 
solution path can be very large. One can prove that this number is bounded by 3^, by observ- 
ing that if xt Tl ] and x' T2 ' have the same sign pattern, then x^' also has the same sign pattern 
for T\ < t < T2- Applying Theorem (4.1) to p = 3 N points x — Xj, this suggests that O(N) 



rows would need to be allocated to a cross validation matrix \P in order for Theorem (4.1) and 



the corollary (28) to apply to the entire solution path, which clearly defeats the compressed 
sensing purpose. However, whenever the matrix $ is an m x N compressed sensing matrix 
of random Gaussian, Bernoulli, or partial Fourier construction, it is observed empirically that 
the number of kinks along a homotopy solution path is bounded by 3m, independent of the 
underlying vector x G M. N used to generate the path. This suggests, at least heuristically, that 
the allocation of O(logm) out of m compressed sensing measurements of this type suffices to 
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ensure that the error \\x — |M H2 for the solution £' T ' = argmin T >o \\^(x — x^)||2 will be within 
a small multiplicative factor of the best possible error in the metric of obtainable by any 
approximant iM along the solution curve r > 0. At the value of r corresponding to x' T ', the 



LASSO solution (27) can be computed using all m measurements $ as a final approximation to x. 



The Dantzig selector (DS) [22] refers to a minimization problem that is similar in form to 
the LASSO problem: 

x T = arg mm \\&x — QzWi^ + r||z||i (29) 



The difference between the DS (29) and LASSO (27) is the choice of norm (i^ versus £2) on the 



fidelity-promoting term. Homotopy-continuation based algorithms have also been developed to 
solve the minimization problem (29) by tracing through the solution path x T i for r' > r. As the 



minimization problem (29) can be reformulated as a linear program, its solution path x T £ 



is seen to be a piecewise constant function of r, in contrast to the LASSO solution path. In 
practice, the total number of breakpoints (n, T2, ...) in the domain < r is observed to be on the 
same order of magnitude as m when the mx N matrix $ satisfies RIP [2jj; thus, the procedure 
just described to cross validate the LASSO solution path can be adapted to cross validate the 



solution path of (29) as well. 



Thus far we have not discussed the possibility of using cross validation as a stopping crite- 



rion for homotopy-type decoding algorithms. Along the LARS homotopy curve (27), most of 
the breakpoints (ti,T2, ...) appear only near the end of the curve in a very small neighborhood of 
t = 0. These breakpoints incur only miniscule changes in the error | \x — x Tj 1 12 even though they 
account for most of the computational expense of the LARS decoding algorithm. Therefore, it 
would be interesting to adapt such algorithms, perhaps using cross validation, to stop once r* 
is reached for which the error ||x — £ r *||2 is sensed to be sufficiently small. 



5.4 Choice of sparsity parameter in greedy-type algorithms 

Greedy compressed sensing decoding algorithms also iterate through a sequence of intermediate 
estimates Xj that could be potential optimal solutions to x under certain reconstruction param- 
eter choices. Orthogonal Matching Pursuit (OMP), which can be viewed as the prototypical 
greedy algorithm in compressed sensing, picks columns from the implementation matrix one 
at a time in a greedy fashion until, after k iterations, the fc-sparse vector Xk, a linear combination 
of the k columns of chosen in the successive iteration steps, is returned as an approximation to 
x. The OMP algorithm is listed in Table 2. Although we will not describe the algorithm in full 
detail, a comprehensive study of OMP can be found in [Bj . Note in particular that OMP requires 
as input a parameter k corresponding to the expected sparsity level for x 6 Mr. Such input is 
typical among greedy algorithms in compressive sensing (in particular, we refer the reader to 
|30] , (29] , and [31] ) . As shown in [6] , OMP will recover with high probability a vector x having at 
most k < m/log{N) nonzero coordinates from its image <£x if $ is a (known) mx N Gaussian or 
Bernoulli matrix with high probability. Over the more general class of vectors x = Xd + M that 
can be decomposed into a (i-sparse vector Xd (with d presumably less than or equal to k) and 
additive noise vector M, we might expect an intermediate estimate x s to be a better estimate 
to x than the final OMP output x^, at least when d « k. Assuming that the signal x admits a 
decomposition of the form x = Xd + J\f, the sequence of intermediate estimates (xi, x^) of an 
OMP algorithm can be cross validated in order to estimate the noise level and recover a better 
approximation to x. We will study this particular application of cross validation in more detail 
below. 
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Table 2: Orthogonal Matching Pursuit Basic Structure 



1. Input: The m-dimensional vector y = Bx, the m x N encoding matrix <!> whose j th column is 
labeled <pj, and the sparsity bound k. 

2. Initialize the decoding algorithm at j = 1, the residual ro = y, and the index set Ao = 0. 

3. Estimate 

(a) Find an index Aj that realizes the bound (<& T rj_i),v = ||<i> T r :( _ 1 || 00 . 

(b) Update the index set Aj = Aj_i U Aj and the submatrix of contributing columns: $j = 

'9i I, ©A,: 

(c) Update the residual: 

Sj = argmm ll^-a; - y\\ 2 = ($f$ j )- 1 $jy, 
rj = rj-i-aj. 

(d) The estimate Xj for the signal has nonzero indices at the components listed in Aj, and the 
value of the estimate Xj in component Aj equals the ith component of sj. 

4. Increment j by 1 and iterate from step 3, if j < k. 

5. Stop: at j = k. Output x omp — Xk as approximation to x. 



6 Orthogonal Matching Pursuit: A case study 

As detailed in Table 2, a single index Xj is added to a set Aj estimated as the j most significant 
coefficients of x at each iteration j of OMP; following the selection of Aj, an estimate ctj to x is 
determined by the least squares solution, 

Xj = arg min — y||2i (30) 

supp(z)eAj 

among the subspace of vectors z E M. N having nonzero coordinates in the index set Aj. OMP 
continues as such, adding a single index Aj to the set Aj at iteration j, until j = k at which 
point the algorithm terminates and returns the £;-sparse vector x omp = xt as approximation to x. 

Suppose x has only d significant coordinates. If d could be specified beforehand, then the 
estimate xj at iteration j = d of OMP would be returned as an approximation to x. However, 
the sparsity d is not known in advance, and k will instead be an upper bound on d. As the 
estimate Xj in OMP can be then identified with the hypothesis that x has j significant coordi- 
nates, the application of cross-validation as described in the previous section applies in a very 



natural way to OMP. In particular, we expect x or and x cv of Theorem (4.1) to be close to the 
estimate Xj at index j = \x\ corresponding to the true sparsity of x; furthermore, in the case 
that | a; | is significantly less than k, we expect the cross validation estimate x cv to be a better 
approximation to x than the OMP-returned estimate xt- We will put this intuition to the test 
in the following numerical experiment. 
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6.1 Experimental setup 

We initialize a signal xq of length N = 3600 and sparsity level d = 100 as 

, f 1, forj = 1...100 , 01 , 
X ^ = {o, else <™ 

Noise is then added to x a = xq + M a in the form of a Gaussian random variable N a distributed 
according to 

J\f a ~ N(0, .05), (32) 

and the resulting vector x a is renormalized to satisfy H^alljw = 1- This yields an expected noise 
level of 

E{a d {x a )) « .284. (33) 

We fix the input k = 200 in Table 2, and assume we have a total number of compressed sensing 
measurements m = 800. A number r of these m measurements are allotted to cross validation, 
while the remaining n = m — r measurements are allocated as input to the OMP algorithm in 



Table 2. This experiment aims to numerically verify Theorem (4.1); to this end, we specify a 
confidence £ = 1/100, and solve for the accuracy e according to the relation r = e~ 2 log(^); 
that is, 



Note that the specification (34) corresponds to setting the constant C = 1 in Theorem (4.1) 



Although C > 8 is needed for the proof of the Johnson Lindenstrauss lemma at present, we find 



that in practice C = 1 already upper bounds the optimal constant needed for Theorem (4.1 ) for 
Gaussian and Bernoulli random ensembles. 

A single (properly normalized) Gaussian n x N measurement matrix <]? is generated (recall 
that n = m - r) , and this matrix and the measurements y = $x are provided as input to the 
OMP algorithm; the resulting sequence of estimates (xi,&2, is stored. The final estimate 

x k from this sequence is the returned OMP estimate x omp to x. The error fjo^p — || x orn p ^||2 
is greater than or equal to the oracle error of the sequence, rj or = min^ ||a; — Xj\\2- 

With the sequence (xi,X2, at hand, we consider 1000 realizations $ ? of an r x JV cross 

validation matrix having the same componentwise distribution as but normalized to have 



variance 1/r according to Theorem (3.1). The cross validation error 



rj cv (q) = mm \\^ q (x - Xj)\\ir (35) 

is measured at each realization *S> q ; we plot the average ff^ v of these 1000 values and intervals 
centered at fj^ v having length equal to twice the empirical standard deviation. Note that we are 
effectively testing 1000 trials of OMP-CV, the algorithm which modifies OMP to incorporate 
cross validation so that (x cv ,ff^ v ) are output instead of x omp = Xk- 



At the specified value of £, Theorem ( |4.1[ ) part ( |15| ) (with constant C = 1) implies that 

(l - e)r] or < rfc v (q) < (l + e)r] or (36) 
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should obtain on at least 990 of the 1000 estimates rj cv (q); in other words, at least 990 of the 
1000 discrepancies \i] or — ffc v (q)\ should be bounded by 

< \fkv(q) - r] or \ < er\ or . (37) 



Using the relation (34) between e and r, this bound becomes tighter as the number r of CV 
measurements increases; however, at the same time, the oracle error r] or increases with r for 
fixed m as fewer measurements n = m — r are input to OMP. An ideal number r of CV mea- 
surements should not be too large or too small; Figure 1 suggests that setting aside just enough 



measurements r such that e < .6 is satisfied in (34) serves as a good heuristic to choose the 
number of cross validation measurements (in Figure 1, e < .6 is satisfied by taking only r = 30 
measurements) . 



We indicate the theoretical bound (36) with dark gray in Figure 1, which is compared to the 



interval in light gray of the 990 values of r) cv (q) that are closest to iq or in actuality. 

This experiment is run for several values of r within the interval [5,90]; the results are plotted 
in Figure 1(a), with the particular range r E [5, 30] blown up in Figure 1(b). 

We have also carried out this experiment with a smaller noise variance; i.e. x b = xo + Mb 
is subject to additive noise 

M b ~ JV(0, .02). (38) 
The signal Xb is again renormalized to satisfy HxjUjjv = 1; it now has an expected noise level of 

E{a d {x b )) « .116. (39) 
The results of this experiment are plotted in Figure 1(c). 



6.2 Experimental Results 

1. We remind the reader that the cross-validation estimates ffa, are observable to the user, 
while the values of rj omp , r] or , along with the noise level o~d{x), are not available to the user. 



Nevertheless, rj cv can serve as a proxy for r\ or according to (36), and this is verified by the 
plots in Figure 1. fj^ v can also provide an upper bound on cr^(x), as is detailed in Section 
5.1. 



2. The theoretical bound (36) is seen to be tight, when compared with the observed concen- 



tration bounds in Figure 1. 

With high probability, the estimate x cv (15) using r = 15 out of the alloted m = 800 
measurements will be a better estimate of x than the OMP estimate: ||x c „(15) — x\\2 < 
||5 ; omp(15) — 1 1 2 - With overwhelming probability, the estimate x c „(30) will result in error 
||x cl) (30) — x\\2 < ||£ O mp(30) — x\\2- We note that the estimates f c „(15) and x ct) (30) 



correspond to accuracy parameters e(15) = .8405 and e(30) = .5943 in (34), indicating 
that e < .6 is a good heuristic to determine when enough CV measurements have been 
reserved. 

4. The OMP-CV estimate x cv will have more pronounced improvement over the OMP es- 
timate Xomp when there is larger discrepancy between the true sparsity d of xq and the 
upper bound k used by OMP (in Figure (l),d= 100 and k = 200). In contrast, OMP-CV 
will not outperform OMP in approximation accuracy when d is close to k; however, the 



multiplicative relation (36) guarantees that OMP-CV will not underperform OMP, either. 
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Figure 1: Comparison of the reconstruction algorithms OMP and OMP-CV. We fix the parameters 
N = 3600, to = 800, k = 200, and underlying sparsity d = 100, but vary the number r of the total to 
measurements reserved for cross validation from 5 to 90, using the remaining n = m — r measurements for 
training. The underlying signal has residual <Jioq(x) ss .284 in Figures 1(a) and 1(b), and aioo(x) s» .116 
in Figure 1(c), as shown for reference by the thin horizontal line. In both cases, the OMP-CV error 
rj cv (the solid black line with error bars; each point represents the average of 1000 trials) gives a better 
approximation to the residual error than does OMP (dot-dashed line) with very high probability, even 
when as few as 20 of the total 800 measurements are used for cross validation. Even though r\ cv is 
guaranteed to provide a tighter bound for r\ or as the number r of CV measurements increases, at the same 
time, the oracle error r\ or becomes a worse indicator of the residual o\q^(x) because fewer measurements 
n = m — r arc input to OMP. 
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7 Beyond Compressed Sensing 



The Compressed Sensing setup can be viewed within the more general class of underdeter- 
mined linear inverse problems, in which x 6 M. N is to be reconstructed from a known m x N 
underdetermined matrix A and lower dimensional vector y = Ax using a decoding algorithm 
A : R m — ► R N ; in this broader context, A is given to the user, but not necessarily specified by 
the user as in compressed sensing. In many cases, a prior assumption of sparsity is imposed 
on x, and an iterative decoding algorithm such as LASSO (27) will be used to reconstruct x 
from y [T7] . If it is possible to take on the order of r = log p additional measurements of x by 
an r x N matrix satisfying the conditions of Lemma (3.1), then all of the analysis presented 
in this paper applies to this more general setting. In particular, the error llx — xJI/jv at up to 

2 

j < p successive approximations Xj of the decoding algorithm A may be bounded from below 
and above using the quantities ||\l/(a; — and the final approximation x to x can be chosen 

from among the entire sequence of estimates Xj as outlined in Theorem (4.1 ); an earlier estimate 
Xj may approximate x better than a final estimate x p which contains the artifacts of parameter 
overfitting occurring at later stages of iteration. 



8 Extensions and Open Problems 

We have presented an alternative approach to compressed sensing in which a certain number r 
of the m allowed measurements of a signal x £ 1BL N are reserved to track the error in decoding by 
the remaining m — r measurements, allowing us to choose a best approximation to x in the metric 
of £2 ou t °f a sequence of p estimates (xj)^ =l , and estimate the error between x and its best 
approximation by a fc-sparse vector, again with respect to the metric of ■ We detailed how the 
number r of such measurements should be chosen in terms of desired accuracy e of estimation, 
confidence level £ in the prediction, and number p of decoding iterations to be measured; in 
general, r = 0(log(p)) measurements suffice. Several important issues remain unresolved; we 
mention only a few below. 

1. The cross validation technique promoted in this paper corresponds specifically to the 
technique of holdout cross validation in statistics, where a data set is partitioned into a 
single training and cross validation set (as a rule of thumb, the cross validation set is 
usually taken to be less than or equal to a third of the size of the training set; in the the 
current paper, we have shown that the Johnson Lindenstrauss lemma provides a theoretical 
justification of how many, or, more precisely, how few, cross validation measurements are 
needed in the context of compressed sensing). Other forms of cross validation, such as 
repeated random subsampling cross validation or K-fold cross validation, remain to be 
analyzed in the context of compressed sensing. The former technique corresponds to 
repeated application of holdout cross validation, with r cross validation measurements out 
of the total m measurements chosen by random selection at each application. The results 
are then averaged (or otherwise combined) to produce a single estimation. The latter 
technique, -fT-fold cross validation, also corresponds to repeated application of holdout 
cross validation. In this case, the m measurements are partitioned into K subsets of equal 
size r, and cross-validation is repeated exactly K times with each of the K subsets of 
measurements used exactly once as the validation set. The K results are again combined 



to produce a single estimation. Although Theorem (4.1) does not directly apply to these 



cross validation models, the experimental results of Section 6 suggest that, equiped with 



an m x N matrix satisfying the requirements of Lemma (3.1), the application of K fold 



cross validation to subsets of the measurements of size r << m — r just large enough 



that e > in Theorem (4.1) for fixed accuracy £ and constant C = 1 can be combined to 
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accurately approximate the underlying signal with near certainty. 



2. It is not clear that the analysis in Theorem (4.1 ) can be extended to the noisy compressed 
sensing model, 

y = $x+N, (40) 

where M ~ N(0,a 2 ) is a Gaussian random variable that accounts for both noise and 
quantization error on the measurements Because measurement noise and quantization 
error are unavoidable in any real-world sensing device, any proposed compressed sensing 



technique should extend to the model (40). Indeed, cross validation is studied in [7J in 
this context as a stopping criterion for decoding algorithms of homotopy /greedy type, 
in the case that x is truly sparse and M is Gaussian noise. The experimental results 
in [7J indicate that cross validation works well in this setting, but it remains to provide 
theoretical justification of these results. 

3. We have only considered cross validation over the metric of £2- However, the error — 
x\ \gN, or root mean squared error, is just one of several metrics used in image processing for 

analyzing the quality of a reconstruction x 6 M. to a (known) image x G Mr. In fact, the 
?l reconstruction error \\x — x\\^n has been argued to outperform the root mean squared 



error as an indicator of reconstruction quality [23]. Unfortunately, Theorem (4.1) cannot 
be extended to the metric of l\ , as there exists no i\ analog of the Johnson Lindenstrauss 
Lemma |27] . However, it remains to understand the extent to which cross validation in 
compressed sensing can be applied over a broader class of image reconstruction metrics, 
perhaps using more refined techniques than those considered in this paper. 

4. Many more compressed sensing matrices than just those satisfying the requirements of 



Lemma (3.1) are observed empirically to satisfy the Johnson Lindenstrauss Lemma; in 



particular, a properly normalized r x N matrix obtained by selecting r rows at random 



from the N x N discrete Fourier matrix is observed to satisfy (3.1 ) for number of measure- 
ments r > e -2 log ^j, and the empirical concentration bounds for these matrices appear to 
be indistinguishable from those of (properly normalized) random Gaussian and Bernoulli 
matrices of the same dimension. This suggests that cross validation should be considered 
as a general technique that can be applied to a set of m compressed sensing measurements, 
the theoretical justification of which is a very interesting problem that we hope to pursue 
in the future. 
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